library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ──────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3 ✓ purrr 0.3.4
✓ tibble 3.0.4 ✓ dplyr 1.0.2
✓ tidyr 1.1.2 ✓ stringr 1.4.0
✓ readr 1.4.0 ✓ forcats 0.5.0
package ‘ggplot2’ was built under R version 3.6.2package ‘tibble’ was built under R version 3.6.2package ‘tidyr’ was built under R version 3.6.2package ‘readr’ was built under R version 3.6.2package ‘purrr’ was built under R version 3.6.2package ‘dplyr’ was built under R version 3.6.2── Conflicts ─────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
house <- read_csv("data/kc_house_data.csv")
── Column specification ─────────────────────────────────────────────────────────
cols(
.default = col_double(),
id = col_character(),
date = col_datetime(format = "")
)
ℹ Use `spec()` for the full column specifications.
house
house_tidy <- house %>%
select(-c(id, date, sqft_living15, sqft_lot15, zipcode)) %>%
mutate(waterfront = as.logical(waterfront)) %>%
mutate(renovated = ifelse(yr_renovated > 0, TRUE, FALSE)) %>%
select(-yr_renovated) %>%
mutate(condition = as_factor(condition)) %>%
mutate(grade = as_factor(grade))
house_tidy
Check for aliased variables using the alias() function (this takes in a formula object and a data set). [Hint - formula price ~ . says ‘price varying with all predictors’, this is a suitable input to alias()]. Remove variables that lead to an alias. Check the ‘Elements of multiple regression’ lesson for a dropdown containing further information on finding aliased variables in a dataset.
alias(price ~., house_tidy)
Model :
price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
waterfront + view + condition + grade + sqft_above + sqft_basement +
yr_built + lat + long + renovated
Complete :
(Intercept) bedrooms bathrooms sqft_living sqft_lot floors
sqft_basement 0 0 0 1 0 0
waterfrontTRUE view condition2 condition3 condition4 condition5
sqft_basement 0 0 0 0 0 0
grade3 grade4 grade5 grade6 grade7 grade8 grade9 grade10 grade11
sqft_basement 0 0 0 0 0 0 0 0 0
grade12 grade13 sqft_above yr_built lat long renovatedTRUE
sqft_basement 0 0 -1 0 0 0 0
house_data <- house_tidy %>%
select(-sqft_living)
alias(price ~., house_data)
Model :
price ~ bedrooms + bathrooms + sqft_lot + floors + waterfront +
view + condition + grade + sqft_above + sqft_basement + yr_built +
lat + long + renovated
library(GGally)
package ‘GGally’ was built under R version 3.6.2Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
houses_tidy_numeric <- house_data %>%
select_if(is.numeric)
houses_tidy_nonnumeric <- house_data %>%
select_if(function(x) !is.numeric(x))
houses_tidy_nonnumeric$price <- house_data$price
ggpairs(houses_tidy_numeric, progress = FALSE)
ggpairs(houses_tidy_nonnumeric, progress = FALSE)
model1 <- lm(price ~ grade, data = house_data)
summary(model1)
Call:
lm(formula = price ~ grade, data = house_data)
Residuals:
Min 1Q Median 3Q Max
-1929615 -135853 -35090 89080 5565658
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 142000 254499 0.558 0.576878
grade3 63667 293870 0.217 0.828484
grade4 72381 258849 0.280 0.779767
grade5 106524 255024 0.418 0.676169
grade6 159920 254561 0.628 0.529868
grade7 260590 254513 1.024 0.305904
grade8 400853 254520 1.575 0.115285
grade9 631513 254547 2.481 0.013112 *
grade10 929771 254611 3.652 0.000261 ***
grade11 1354842 254817 5.317 1.07e-07 ***
grade12 2049222 255909 8.008 1.23e-15 ***
grade13 3567615 264106 13.508 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 254500 on 21601 degrees of freedom
Multiple R-squared: 0.5197, Adjusted R-squared: 0.5195
F-statistic: 2125 on 11 and 21601 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model1)
not plotting observations with leverage one:
19453
null_model <- lm(price ~ 1, data = house_data)
grade_model <- lm(price ~ grade, data = house_data)
anova(null_model, grade_model)
Analysis of Variance Table
Model 1: price ~ 1
Model 2: price ~ grade
Res.Df RSS Df Sum of Sq F Pr(>F)
1 21612 2.9129e+15
2 21601 1.3991e+15 11 1.5138e+15 2124.8 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
library(modelr)
package ‘modelr’ was built under R version 3.6.2
price_remaining_resid <- house_data %>%
add_residuals(model1) %>%
select(-c(price, grade))
houses_resid_numeric <- price_remaining_resid %>%
select_if(is.numeric)
houses_resid_nonnumeric <- price_remaining_resid %>%
select_if(function(x) !is.numeric(x))
houses_resid_nonnumeric$resid <- price_remaining_resid$resid
houses_resid_nonnumeric %>%
ggpairs(aes( alpha = 0.5), progress = FALSE)
houses_resid_numeric %>%
ggpairs(aes( alpha = 0.5), progress = FALSE)
model2 <- lm(price ~ grade + lat, data = house_data)
summary(model2)
Call:
lm(formula = price ~ grade + lat, data = house_data)
Residuals:
Min 1Q Median 3Q Max
-1863783 -112379 -28181 67454 5533910
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -29949769 610666 -49.044 < 2e-16 ***
grade3 187923 276126 0.681 0.496151
grade4 95367 243212 0.392 0.694978
grade5 126354 239618 0.527 0.597980
grade6 158453 239183 0.662 0.507673
grade7 246275 239138 1.030 0.303093
grade8 378635 239144 1.583 0.113369
grade9 603010 239170 2.521 0.011701 *
grade10 891752 239230 3.728 0.000194 ***
grade11 1311124 239425 5.476 4.4e-08 ***
grade12 2007093 240450 8.347 < 2e-16 ***
grade13 3502099 248154 14.113 < 2e-16 ***
lat 633100 11822 53.554 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 239100 on 21600 degrees of freedom
Multiple R-squared: 0.576, Adjusted R-squared: 0.5758
F-statistic: 2445 on 12 and 21600 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model2)
not plotting observations with leverage one:
19453
price_remaining_resid <- house_data%>%
add_residuals(model2) %>%
select(-c(price, grade, lat))
houses_resid_numeric <- price_remaining_resid %>%
select_if(is.numeric)
houses_resid_nonnumeric <- price_remaining_resid %>%
select_if(function(x) !is.numeric(x))
houses_resid_nonnumeric$resid <- price_remaining_resid$resid
houses_resid_nonnumeric %>%
ggpairs(aes( alpha = 0.5), progress = FALSE)
houses_resid_numeric %>%
ggpairs(aes( alpha = 0.5), progress = FALSE)
model3 <- lm(price ~ grade + lat + view, data = house_data)
summary(model3)
Call:
lm(formula = price ~ grade + lat + view, data = house_data)
Residuals:
Min 1Q Median 3Q Max
-1665265 -105866 -21393 69623 5429667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -30514024 576750 -52.907 < 2e-16 ***
grade3 190253 260743 0.730 0.465607
grade4 81058 229663 0.353 0.724133
grade5 112154 226269 0.496 0.620134
grade6 148515 225858 0.658 0.510827
grade7 235345 225815 1.042 0.297328
grade8 351873 225822 1.558 0.119203
grade9 556584 225848 2.464 0.013731 *
grade10 821118 225907 3.635 0.000279 ***
grade11 1200228 226097 5.308 1.12e-07 ***
grade12 1832950 227080 8.072 7.28e-16 ***
grade13 3303587 234361 14.096 < 2e-16 ***
lat 644972 11166 57.764 < 2e-16 ***
view 106862 2086 51.234 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 225800 on 21599 degrees of freedom
Multiple R-squared: 0.6219, Adjusted R-squared: 0.6217
F-statistic: 2733 on 13 and 21599 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model3)
not plotting observations with leverage one:
19453
anova(model2, model3)
Analysis of Variance Table
Model 1: price ~ grade + lat
Model 2: price ~ grade + lat + view
Res.Df RSS Df Sum of Sq F Pr(>F)
1 21600 1.2351e+15
2 21599 1.1013e+15 1 1.3383e+14 2624.9 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
price_remaining_resid <- house_data %>%
add_residuals(model3) %>%
select(-c(price, grade, lat, view))
houses_resid_numeric <- price_remaining_resid %>%
select_if(is.numeric)
houses_resid_nonnumeric <- price_remaining_resid %>%
select_if(function(x) !is.numeric(x))
houses_resid_nonnumeric$resid <- price_remaining_resid$resid
houses_resid_nonnumeric %>%
ggpairs(aes( alpha = 0.5), progress = FALSE)
houses_resid_numeric %>%
ggpairs(aes( alpha = 0.5), progress = FALSE)
price_resid <- house_data %>%
add_residuals(model3) %>%
select(-price)
price_resid %>%
ggplot(aes(x = view, y = resid, color = grade)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
price_resid %>%
ggplot(aes(x = sqft_basement, y = resid, color = grade)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
model4 <- lm(price ~ grade + lat + view + sqft_basement + view:sqft_basement, data = house_data)
summary(model4)
Call:
lm(formula = price ~ grade + lat + view + sqft_basement + view:sqft_basement,
data = house_data)
Residuals:
Min 1Q Median 3Q Max
-1638784 -104221 -19570 69709 5166209
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.883e+07 5.636e+05 -51.155 < 2e-16 ***
grade3 1.833e+05 2.537e+05 0.723 0.469931
grade4 8.399e+04 2.234e+05 0.376 0.707022
grade5 1.118e+05 2.201e+05 0.508 0.611432
grade6 1.385e+05 2.197e+05 0.630 0.528579
grade7 2.087e+05 2.197e+05 0.950 0.342139
grade8 3.238e+05 2.197e+05 1.474 0.140599
grade9 5.312e+05 2.197e+05 2.417 0.015645 *
grade10 7.884e+05 2.198e+05 3.587 0.000335 ***
grade11 1.155e+06 2.200e+05 5.250 1.54e-07 ***
grade12 1.742e+06 2.210e+05 7.886 3.27e-15 ***
grade13 3.138e+06 2.281e+05 13.756 < 2e-16 ***
lat 6.096e+05 1.092e+04 55.822 < 2e-16 ***
view 7.106e+04 2.847e+03 24.962 < 2e-16 ***
sqft_basement 1.056e+02 3.910e+00 27.003 < 2e-16 ***
view:sqft_basement 2.870e+01 3.073e+00 9.339 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 219700 on 21597 degrees of freedom
Multiple R-squared: 0.6422, Adjusted R-squared: 0.6419
F-statistic: 2584 on 15 and 21597 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model4)
not plotting observations with leverage one:
19453
model5 <- lm(price ~ grade + lat + view + sqft_basement + lat:sqft_basement, data = house_data)
summary(model5)
Call:
lm(formula = price ~ grade + lat + view + sqft_basement + lat:sqft_basement,
data = house_data)
Residuals:
Min 1Q Median 3Q Max
-1550147 -101737 -18763 68303 5174888
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.488e+07 6.359e+05 -39.124 < 2e-16 ***
grade3 1.670e+05 2.532e+05 0.659 0.509600
grade4 7.862e+04 2.230e+05 0.353 0.724442
grade5 1.072e+05 2.197e+05 0.488 0.625492
grade6 1.362e+05 2.193e+05 0.621 0.534705
grade7 2.060e+05 2.193e+05 0.939 0.347638
grade8 3.210e+05 2.193e+05 1.464 0.143228
grade9 5.299e+05 2.193e+05 2.416 0.015700 *
grade10 7.885e+05 2.194e+05 3.594 0.000326 ***
grade11 1.158e+06 2.196e+05 5.274 1.35e-07 ***
grade12 1.765e+06 2.205e+05 8.004 1.26e-15 ***
grade13 3.143e+06 2.276e+05 13.809 < 2e-16 ***
lat 5.264e+05 1.256e+04 41.918 < 2e-16 ***
view 8.911e+04 2.093e+03 42.569 < 2e-16 ***
sqft_basement -1.698e+04 1.310e+03 -12.968 < 2e-16 ***
lat:sqft_basement 3.595e+02 2.752e+01 13.060 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 219300 on 21597 degrees of freedom
Multiple R-squared: 0.6435, Adjusted R-squared: 0.6433
F-statistic: 2599 on 15 and 21597 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(model5)
not plotting observations with leverage one:
19453
library(relaimpo)
Loading required package: MASS
package ‘MASS’ was built under R version 3.6.2
Attaching package: ‘MASS’
The following object is masked from ‘package:dplyr’:
select
Loading required package: boot
Loading required package: survey
package ‘survey’ was built under R version 3.6.2Loading required package: grid
Loading required package: Matrix
Attaching package: ‘Matrix’
The following objects are masked from ‘package:tidyr’:
expand, pack, unpack
Loading required package: survival
Attaching package: ‘survival’
The following object is masked from ‘package:boot’:
aml
Attaching package: ‘survey’
The following object is masked from ‘package:graphics’:
dotchart
Loading required package: mitools
This is the global version of package relaimpo.
If you are a non-US user, a version with the interesting additional metric pmvd is available
from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping.
calc.relimp(model4, type = "lmg", rela = TRUE)
Response variable: price
Total response variance: 134782378397
Analysis based on 21613 observations
15 Regressors:
Some regressors combined in groups:
Group grade : grade3 grade4 grade5 grade6 grade7 grade8 grade9 grade10 grade11 grade12 grade13
Relative importance of 5 (groups of) regressors assessed:
grade lat view sqft_basement view:sqft_basement
Proportion of variance explained by model: 64.22%
Metrics are normalized to sum to 100% (rela=TRUE).
Relative importance metrics:
lmg
grade 0.641619393
lat 0.108144482
view 0.152406200
sqft_basement 0.092231176
view:sqft_basement 0.005598749
Average coefficients for different model sizes:
1group 2groups 3groups 4groups
grade3 63666.6672 105085.5034 1.442515e+05 1.233187e+05
grade4 72381.0351 74840.3850 7.783802e+04 7.176178e+04
grade5 106523.9716 106120.0182 1.068609e+05 1.005730e+05
grade6 159919.6380 148997.5902 1.408212e+05 1.365309e+05
grade7 260590.2629 235879.0272 2.169625e+05 2.105291e+05
grade8 400852.7662 366263.3497 3.391397e+05 3.295025e+05
grade9 631513.1864 588676.4480 5.544622e+05 5.412054e+05
grade10 929771.0746 870400.5336 8.228434e+05 8.029513e+05
grade11 1354841.7274 1272639.8932 1.206981e+06 1.173765e+06
grade12 2049222.0006 1930531.4478 1.836852e+06 1.771388e+06
grade13 3567615.3852 3398156.8184 3.266023e+06 3.177075e+06
lat 813411.5832 722508.3996 6.605253e+05 6.805225e+05
view 190335.2479 151137.4701 1.164405e+05 9.016975e+04
sqft_basement 268.6136 203.8056 1.542690e+02 1.239932e+02
view:sqft_basement NaN NaN 7.385838e+01 5.110896e+01
5groups
grade3 1.833136e+05
grade4 8.398524e+04
grade5 1.118412e+05
grade6 1.384799e+05
grade7 2.087124e+05
grade8 3.237704e+05
grade9 5.311656e+05
grade10 7.884248e+05
grade11 1.154893e+06
grade12 1.742429e+06
grade13 3.137509e+06
lat 6.096140e+05
view 7.105637e+04
sqft_basement 1.055758e+02
view:sqft_basement 2.869929e+01
price_resid %>%
ggplot(aes(x = sqft_basement, y = resid, colour = grade)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ grade)